Data Analytics and Machine Learning - Individual Assignment¶

This notebook explores the books dataset obtained by scraping BooksMandala

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.subplots import make_subplots

from wordcloud import WordCloud, STOPWORDS

from sklearn.cluster import SpectralClustering
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer

from sentence_transformers import SentenceTransformer

from scipy.sparse import hstack, csr_matrix

import umap

import ast
import re
from tqdm.autonotebook import tqdm, trange
2024-11-08 22:20:30.132707: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Dataset Loading¶

In [2]:
filepath: str = "/home/am/booksmandala-data-analytics/notebooks/data/dataset.csv"
df = pd.read_csv(filepath)
df.head()
Out[2]:
Title Author Price Rating Limited Stock Discount Genre Number of Pages Weight ISBN Language Related Genres Subgenres Synopsis URL
0 The Gruffalo by Julia Donaldson Rs. 720 NaN Only 3 item left in stock! NaN Arts And Photography 33 Pages 196g 9781509804757 English Kids and Teens, Arts and Photography, Kids and... Ages 3 to 5\n, Picture Books\n, Ages 3 to 5, P... A mouse took a stroll through the deep dark wo... https://booksmandala.com/books/the-gruffalo-12894
1 Tibetan Pilgrimage :Architecture of the Sacred... by Michel Peisel Rs. 1200 NaN NaN NaN Arts And Photography NaN 1050g 9780810959446 English Arts and Photography, Miscellaneous, Arts and ... Architecture\n, Books on Tibet\n, Architecture... With nearly a hundred exceptional watercolor i... https://booksmandala.com/books/tibetan-pilgrim...
2 The Sacred Mountain by Dalai Lama Xiv Bstan-ʼDzin-Rgya-Mtsho and J... Rs. 1592 NaN NaN NaN Arts And Photography 457 Pages 970g 9788120831520 English Travel, Arts and Photography, Travel, Arts and... Climbing and Mountaineering\n, Picture Books\n... (4) Truth of the path leading to the annihilat... https://booksmandala.com/books/the-sacred-moun...
3 The Inner Game of Music by Barry Green and W. Timothy Gallwey Rs. 1040 NaN Only 6 item left in stock! NaN Arts And Photography 248 Pages 200g 9781447291725 English Arts and Photography, Self Improvement and Rel... Music\n, Self Help\n, Psychology\n, Music, Sel... The bestselling guide to improving musical per... https://booksmandala.com/books/the-inner-game-...
4 Hooked: How to Build Habit-Forming Products by Nir Eyal and Ryan Hoover Rs. 1118 NaN NaN NaN Arts And Photography 242 Pages 340g 9780241184837 English Business and Investing, Arts and Photography, ... Business\n, Design\n, Psychology\n, Self Help\... How do successful companies create products pe... https://booksmandala.com/books/hooked-how-to-b...
In [3]:
df.describe()
Out[3]:
Rating
count 252.000000
mean 4.440873
std 0.813651
min 1.000000
25% 4.000000
50% 5.000000
75% 5.000000
max 5.000000
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2840 entries, 0 to 2839
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Title            2840 non-null   object 
 1   Author           2840 non-null   object 
 2   Price            2840 non-null   object 
 3   Rating           252 non-null    float64
 4   Limited Stock    1729 non-null   object 
 5   Discount         41 non-null     object 
 6   Genre            2840 non-null   object 
 7   Number of Pages  2640 non-null   object 
 8   Weight           2840 non-null   object 
 9   ISBN             2840 non-null   object 
 10  Language         2840 non-null   object 
 11  Related Genres   2840 non-null   object 
 12  Subgenres        2655 non-null   object 
 13  Synopsis         2835 non-null   object 
 14  URL              2840 non-null   object 
dtypes: float64(1), object(14)
memory usage: 332.9+ KB

Not much numeric data to work with

Preprocessing¶

In [5]:
# show data types that are non-numeric
df.select_dtypes("object").columns
Out[5]:
Index(['Title', 'Author', 'Price', 'Limited Stock', 'Discount', 'Genre',
       'Number of Pages', 'Weight', 'ISBN', 'Language', 'Related Genres',
       'Subgenres', 'Synopsis', 'URL'],
      dtype='object')
In [6]:
na = df.isna().sum()
na[na > 0]
Out[6]:
Rating             2588
Limited Stock      1111
Discount           2799
Number of Pages     200
Subgenres           185
Synopsis              5
dtype: int64

Price on BooksMandala¶

Books on BooksMandala are often on sale or having discounts. While scraping for price, the entire text is extracted, including the discount amount and original price.

In [7]:
df["Price"].unique()
Out[7]:
array(['Rs. 720', 'Rs. 1200', 'Rs. 1592', 'Rs. 1040', 'Rs. 1118',
       'Rs. 1500', 'Rs. 2500', 'Rs. 2238', 'Rs. 1438', 'Rs. 700',
       'Rs. 695', 'Rs. 958', 'Rs. 880', 'Rs. 256', 'Rs. 2078', 'Rs. 800',
       'Rs. 142', 'Rs. 2560', 'Rs. 4800', 'Rs. 798', 'Rs. 960', 'Rs. 558',
       'Rs. 2800', 'Rs. 11360', 'Rs. 640', 'Rs. 1598', 'Rs. 300',
       'Rs. 4000', 'Rs. 3000', 'Rs. 478', 'Rs. 638', 'Rs. 398', 'Rs. 96',
       'Rs. 4480', 'Rs. 472', 'Rs. 318', 'Rs. 560', 'Rs. 3200',
       'Rs. 2000', 'Rs. 366', 'Rs. 200', 'Rs. 288', 'Rs. 2718', 'Rs. 392',
       'Rs. 5120', 'Rs. 650', 'Rs. 2160', 'Rs. 1360', 'Rs. 1278',
       'Rs. 760', 'Rs. 400', 'Rs. 176', 'Rs. 4649', 'Rs. 238', 'Rs. 125',
       'Rs. 1520', 'Rs. 1150', 'Rs. 3417', 'Rs. 2870', 'Rs. 70',
       'Rs. 999', 'Rs. 3198', 'Rs. 3358', 'Rs. 2225', 'Rs. 260',
       'Rs. 1995', 'Rs. 1917', 'Rs. 705', 'Rs. 1600', 'Rs. 4318',
       'Rs. 1918', 'Rs. 1758', 'Rs. 360', 'Rs. 280', 'Rs. 224', 'Rs. 375',
       'Rs. 2398', 'Rs.  358.5 Rs.  478( 25% OFF)', 'Rs. 840',
       'Rs.  240 Rs.  320( 25% OFF)', 'Rs.  780 Rs.  1040( 25% OFF)',
       'Rs. 399', 'Rs. 824', 'Rs. 552', 'Rs. 350', 'Rs. 499', 'Rs. 312',
       'Rs. 295', 'Rs.  558.6 Rs.  798( 30% OFF)',
       'Rs.  598.5 Rs.  798( 25% OFF)', 'Rs.  670.6 Rs.  958( 30% OFF)',
       'Rs.  504 Rs.  720( 30% OFF)', 'Rs.  196 Rs.  280( 30% OFF)',
       'Rs.  838.5 Rs.  1118( 25% OFF)', 'Rs. 2470', 'Rs. 455', 'Rs. 95',
       'Rs. 875', 'Rs. 1000', 'Rs.  382.8 Rs.  638( 40% OFF)', 'Rs. 6718',
       'Rs.  718.5 Rs.  958( 25% OFF)', 'Rs. 3518', 'Rs. 2558',
       'Rs.  218.4 Rs.  312( 30% OFF)', 'Rs. 500', 'Rs. 600', 'Rs. 666',
       'Rs. 1958', 'Rs.  280 Rs.  400( 30% OFF)',
       'Rs.  262.5 Rs.  350( 25% OFF)', 'Rs. 550', 'Rs. 450', 'Rs. 240',
       'Rs. 850', 'Rs. 100', 'Rs. 2295', 'Rs. 1425', 'Rs. 150', 'Rs. 250',
       'Rs. 1950', 'Rs. 750', 'Rs. 675', 'Rs. 60', 'Rs. 90', 'Rs. 380',
       'Rs. 525', 'Rs. 632', 'Rs. 1700', 'Rs. 545', 'Rs. 495', 'Rs. 50',
       'Rs. 160', 'Rs. 1750', 'Rs. 456', 'Rs. 507', 'Rs. 480', 'Rs. 645',
       'Rs. 190', 'Rs. 680', 'Rs. 1349', 'Rs. 608', 'Rs. 752', 'Rs. 85',
       'Rs. 110', 'Rs. 993', 'Rs. 816', 'Rs. 330', 'Rs. 792', 'Rs. 75',
       'Rs. 65', 'Rs. 660', 'Rs.  330.4 Rs.  472( 30% OFF)', 'Rs. 998',
       'Rs. 952', 'Rs. 580', 'Rs.  360 Rs.  480( 25% OFF)',
       'Rs.  105 Rs.  150( 30% OFF)', 'Rs. 158', 'Rs. 230', 'Rs. 78',
       'Rs. 130', 'Rs. 336', 'Rs. 112', 'Rs. 298', 'Rs. 275', 'Rs. 175',
       'Rs. 320', 'Rs. 195', 'Rs. 395', 'Rs. 325', 'Rs. 770', 'Rs. 25598',
       'Rs. 145', 'Rs. 3192', 'Rs. 140', 'Rs. 299', 'Rs. 1440', 'Rs. 105',
       'Rs. 490', 'Rs. 152', 'Rs. 115', 'Rs. 1112', 'Rs. 3652', 'Rs. 995',
       'Rs. 530', 'Rs. 704', 'Rs.  180 Rs.  240( 25% OFF)', 'Rs. 832',
       'Rs. 795', 'Rs. 72', 'Rs. 128', 'Rs. 225', 'Rs. 290', 'Rs. 2878',
       'Rs. 3038', 'Rs. 1344', 'Rs. 1760', 'Rs. 672', 'Rs. 1272',
       'Rs. 448', 'Rs. 1116', 'Rs.  446.6 Rs.  638( 30% OFF)', 'Rs. 1120',
       'Rs. 2240', 'Rs. 1568', 'Rs. 440', 'Rs. 1232', 'Rs. 896',
       'Rs. 1840', 'Rs. 950', 'Rs. 1680', 'Rs. 80', 'Rs. 22398',
       'Rs. 1250', 'Rs. 2320', 'Rs. 6398', 'Rs. 3280', 'Rs. 2640',
       'Rs. 19198', 'Rs. 6800', 'Rs. 304', 'Rs. 1960', 'Rs. 2880',
       'Rs. 625', 'Rs. 2544', 'Rs. 475', 'Rs. 4792', 'Rs. 120', 'Rs. 624',
       'Rs. 520', 'Rs. 104', 'Rs. 432', 'Rs. 170', 'Rs. 384', 'Rs. 374',
       'Rs. 768', 'Rs.  1008 Rs.  1440( 30% OFF)', 'Rs. 496', 'Rs. 216',
       'Rs. 1275', 'Rs. 1080', 'Rs. 1800', 'Rs. 592', 'Rs. 3400',
       'Rs. 1280', 'Rs. 1300', 'Rs. 340', 'Rs. 1007', 'Rs. 64', 'Rs. 787',
       'Rs. 1115', 'Rs. 1595', 'Rs. 900', 'Rs. 486', 'Rs. 1584',
       'Rs. 2072', 'Rs. 1920', 'Rs. 2726', 'Rs. 790', 'Rs. 944',
       'Rs. 598', 'Rs. 333', 'Rs. 555', 'Rs. 425', 'Rs. 498', 'Rs. 576',
       'Rs. 599', 'Rs. 595', 'Rs. 220', 'Rs. 775', 'Rs. 548', 'Rs. 575',
       'Rs. 348', 'Rs. 265', 'Rs. 698', 'Rs. 699', 'Rs. 458', 'Rs. 777',
       'Rs. 648', 'Rs. 748', 'Rs. 445', 'Rs. 485', 'Rs. 1904', 'Rs. 688',
       'Rs. 1142', 'Rs. 2549', 'Rs. 14260', 'Rs. 2224', 'Rs. 9792',
       'Rs. 1274', 'Rs. 6000', 'Rs. 1825', 'Rs. 1277', 'Rs. 2100',
       'Rs. 2400', 'Rs. 3600', 'Rs. 2545', 'Rs. 3998', 'Rs. 928',
       'Rs. 9918', 'Rs. 1593', 'Rs.  600 Rs.  800( 25% OFF)',
       'Rs.  489.3 Rs.  699( 30% OFF)', 'Rs. 206', 'Rs. 1208', 'Rs. 4134',
       'Rs. 5278', 'Rs. 2480', 'Rs. 2141', 'Rs. 5118', 'Rs. 4024',
       'Rs. 1180', 'Rs. 3500', 'Rs. 3388', 'Rs.  477 Rs.  795( 40% OFF)',
       'Rs. 3437', 'Rs. 1429', 'Rs. 2514', 'Rs. 2250', 'Rs. 742',
       'Rs. 1550', 'Rs. 590', 'Rs. 1939'], dtype=object)

The Price column also contains values like Rs. 358.5 Rs. 478( 25% OFF). We can use RegEx to match the price of the book.

In [ ]:
def extract_price(price: str) -> dict:
    matches = re.findall(r'Rs\.\s+(\d+(\.\d+)?)', price)

    if matches:
        if len(matches) == 1:
            # only one price, treat it as original price
            list_price = float(matches[0][0])
            return {
                'discounted_price': 0,  # No discounted price
                'list_price': list_price
            }
        elif len(matches) >= 2:
            # Two prices, treat the first as discounted and second as original
            discounted_price = float(matches[0][0])
            list_price = float(matches[1][0])
            return {
                'discounted_price': discounted_price,
                'list_price': list_price
            }
    return {'discounted_price': None, 'list_price': None}


print(extract_price("Rs.  358.5 Rs.  478( 25% OFF)"))
print(extract_price("Rs. 450"))
{'discounted_price': 358.5, 'list_price': 478.0}
{'discounted_price': 0, 'list_price': 450.0}
In [ ]:
prices = df["Price"].apply(extract_price)
df["Price"] = prices.apply(lambda x: x["discounted_price"]
                           if x["discounted_price"] != 0 else x["list_price"])
df["List Price"] = prices.apply(lambda x: x["list_price"])
df["Discount Amount"] = prices.apply(
    lambda x: 0 if x["discounted_price"] == 0 else x["list_price"] - x["discounted_price"])

df[["Price", "List Price", "Discount Amount"]]
Out[ ]:
Price List Price Discount Amount
0 720.0 720.0 0.0
1 1200.0 1200.0 0.0
2 1592.0 1592.0 0.0
3 1040.0 1040.0 0.0
4 1118.0 1118.0 0.0
... ... ... ...
2835 500.0 500.0 0.0
2836 798.0 798.0 0.0
2837 632.0 632.0 0.0
2838 880.0 880.0 0.0
2839 560.0 560.0 0.0

2840 rows × 3 columns

In [10]:
print(df["Number of Pages"])
print("Null count: ", df["Number of Pages"].isna().sum())
0        33 Pages
1             NaN
2       457 Pages
3       248 Pages
4       242 Pages
          ...    
2835    336 Pages
2836    112 Pages
2837    266 Pages
2838    312 Pages
2839    189 Pages
Name: Number of Pages, Length: 2840, dtype: object
Null count:  200

There aren't any values that have a decimal, so we can just replace non-digit \D with spaces using RegEx

In [ ]:
df["Number of Pages"] = df["Number of Pages"][~df["Number of Pages"].isna()] \
    .str.replace(r"\D", "", regex=True).astype("int")

df["Number of Pages"]
Out[ ]:
0        33.0
1         NaN
2       457.0
3       248.0
4       242.0
        ...  
2835    336.0
2836    112.0
2837    266.0
2838    312.0
2839    189.0
Name: Number of Pages, Length: 2840, dtype: float64
In [12]:
df["Weight"].unique()
Out[12]:
array(['196g', '1050g', '970g', '200g', '340g', '515g', '2400g', '370g',
       '1200g', '290g', '585g', '250g', '560g', '300g', '344g', '1260g',
       '525g', '86g', '2160g', '1550g', '490g', '640g', '339g', '520g',
       '180g', '550g', '260g', '30g', '700g', '675g', '1730g', '2960g',
       '335g', '415g', '1140g', '790g', '170g', '1320g', '1500g', '530g',
       '150g', '310g', '569g', '130g', '100g', '1400g', '1160g', '80g',
       '800g', '360g', '825g', '500g', '600g', '280g', '140g', '110g',
       '135g', '910g', '1375g', '275g', '660g', '642g', '210g', '206g',
       '450g', '75g', '1340g', '161g', '940g', '380g', '460g', '477g',
       '220g', '197g', '615g', '565g', '315g', '690g', '1323g', '225g',
       '70g', '230g', '2475g', '555g', '830g', '1640g', '195g', '215g',
       '390g', '305g', '960g', '650g', '540g', '720g', '715g', '405g',
       '1860g', '915g', '725g', '843g', '175g', '365g', '249g', '625g',
       '235g', '245g', '134g', '325g', '505g', '410g', '285g', '648g',
       '425g', '400g', '181g', '350g', '454g', '320g', '545g', '440g',
       '190g', '375g', '1180g', '136g', '160g', '205g', '470g', '145g',
       '865g', '240g', '465g', '192g', '345g', '219g', '730g', '270g',
       '420g', '480g', '330g', '355g', '395g', '295g', '95g', '265g',
       '512g', '820g', '635g', '735g', '90g', '430g', '165g', '255g',
       '269g', '185g', '155g', '670g', '337g', '294g', '166g', '232g',
       '189g', '277g', '72g', '348g', '301g', '2800g', '1425g', '870g',
       '385g', '475g', '252g', '575g', '91g', '710g', '1075g', '630g',
       '1680g', '935g', '177g', '378g', '203g', '2000g', '535g', '610g',
       '510g', '216g', '590g', '317g', '120g', '99g', '142g', '495g',
       '1030g', '580g', '164g', '169g', '605g', '1100g', '422g', '807g',
       '336g', '890g', '187g', '85g', '34g', '188g', '125g', '105g',
       '146g', '7900g', '122g', '920g', '147g', '810g', '162g', '620g',
       '65g', '60g', '228g', '876g', '50g', '455g', '238g', '123g',
       '323g', '93g', '248g', '894g', '213g', '1130g', '765g', '261g',
       '514g', '850g', '322g', '257g', '94g', '595g', '491g', '1790g',
       '115g', '264g', '1175g', '506g', '1010g', '1300g', '1810g', '944g',
       '174g', '771g', '885g', '805g', '382g', '1440g', '204g', '272g',
       '1430g', '1275g', '570g', '1073g', '312g', '1335g', '127g', '657g',
       '435g', '372g', '246g', '499g', '1250g', '746g', '4940g', '211g',
       '369g', '191g', '397g', '376g', '4000g', '222g', '1350g', '900g',
       '55g', '566g', '445g', '131g', '945g', '1190g', '1090g', '1080g',
       '760g', '2050g', '318g', '8000g', '263g', '1082g', '352g', '242g',
       '548g', '167g', '485g', '1070g', '1465g', '1485g', '780g', '1185g',
       '680g', '1040g', '346g', '1460g', '214g', '1490g', '293g', '227g',
       '579g', '52g', '62g', '750g', '401g', '386g', '429g', '1330g',
       '840g', '1000g', '3000g', '1150g', '1299g', '975g', '1g', '2425g',
       '2150g', '1055g', '1950g', '1820g', '1620g', '880g', '860g',
       '1760g', '770g', '1420g', '1530g', '1870g', '432g', '1560g',
       '1590g', '786g', '785g', '159g', '132g', '999g', '1165g', '226g',
       '816g', '2250g', '2930g', '67g', '144g', '243g', '685g', '645g',
       '439g', '463g', '302g', '296g', '407g', '359g', '740g', '202g',
       '459g', '1060g', '736g', '995g', '1280g', '835g', '286g', '695g'],
      dtype=object)
In [13]:
df["Weight"] = df["Weight"].str.replace("g", "", regex=True).astype("int")
In [14]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2840 entries, 0 to 2839
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Title            2840 non-null   object 
 1   Author           2840 non-null   object 
 2   Price            2840 non-null   float64
 3   Rating           252 non-null    float64
 4   Limited Stock    1729 non-null   object 
 5   Discount         41 non-null     object 
 6   Genre            2840 non-null   object 
 7   Number of Pages  2640 non-null   float64
 8   Weight           2840 non-null   int64  
 9   ISBN             2840 non-null   object 
 10  Language         2840 non-null   object 
 11  Related Genres   2840 non-null   object 
 12  Subgenres        2655 non-null   object 
 13  Synopsis         2835 non-null   object 
 14  URL              2840 non-null   object 
 15  List Price       2840 non-null   float64
 16  Discount Amount  2840 non-null   float64
dtypes: float64(5), int64(1), object(11)
memory usage: 377.3+ KB
In [15]:
df["Discount"].value_counts()
Out[15]:
Discount
( 25% OFF)    19
( 30% OFF)    18
( 40% OFF)     4
Name: count, dtype: int64
In [ ]:
df["Discount"] = df["Discount"].str.replace(
    r"\D", "", regex=True).astype("float")
df.fillna({"Discount": 0}, inplace=True)
df["Discount"] = df["Discount"].apply(lambda x: x / 100)

df["Discount"].value_counts()
Out[ ]:
Discount
0.00    2799
0.25      19
0.30      18
0.40       4
Name: count, dtype: int64

Limited Stocks¶

Limited Stock data is only extracted when the stock is limited. Safe to make this column binary.

In [17]:
df["Limited Stock"]
Out[17]:
0       Only 3 item left in stock!
1                              NaN
2                              NaN
3       Only 6 item left in stock!
4                              NaN
                   ...            
2835                           NaN
2836    Only 4 item left in stock!
2837    Only 5 item left in stock!
2838                           NaN
2839                           NaN
Name: Limited Stock, Length: 2840, dtype: object
In [ ]:
df["Limited Stock"] = df["Limited Stock"].apply(
    lambda x: True if pd.notna(x) else False)
df["Limited Stock"].value_counts()
Out[ ]:
Limited Stock
True     1729
False    1111
Name: count, dtype: int64

Author¶

In [19]:
df["Author"]
Out[19]:
0                                      by Julia Donaldson
1                                        by Michel Peisel
2       by Dalai Lama Xiv Bstan-ʼDzin-Rgya-Mtsho and J...
3                   by Barry Green and W. Timothy Gallwey
4                             by Nir Eyal and Ryan Hoover
                              ...                        
2835                                 by Bishnu Raj Upreti
2836                                      by Jack Kerouac
2837                                         by John Wood
2838                                 by Peter Matthiessen
2839                                     by Amar. Bhushan
Name: Author, Length: 2840, dtype: object
In [20]:
df["Author"] = df["Author"].str.replace("by", "")
df["Author"] = df["Author"].str.replace(" and ", ", ")
df["Author"] = df["Author"].apply(lambda x: x.strip())
In [21]:
df["Author"]
Out[21]:
0                                         Julia Donaldson
1                                           Michel Peisel
2       Dalai Lama Xiv Bstan-ʼDzin-Rgya-Mtsho, John Sn...
3                         Barry Green, W. Timothy Gallwey
4                                   Nir Eyal, Ryan Hoover
                              ...                        
2835                                    Bishnu Raj Upreti
2836                                         Jack Kerouac
2837                                            John Wood
2838                                    Peter Matthiessen
2839                                        Amar. Bhushan
Name: Author, Length: 2840, dtype: object

Genres on BooksMandala¶

Preprocessing the Related Genres column¶

Every book has multiple genres. A single book isn't limited to a single genre.

The related genres section on a book's webpage on BooksMandala describe the multiple genres a book belongs to.

In [22]:
df["Related Genres"].value_counts()
Out[22]:
Related Genres
Foreign Languages, Foreign Languages                                                                                                                                              122
Nepali, Nepali                                                                                                                                                                    105
Miscellaneous\n, Miscellaneous                                                                                                                                                    102
Arts and Photography, Arts and Photography                                                                                                                                         52
Kids and Teens, Kids and Teens                                                                                                                                                     51
                                                                                                                                                                                 ... 
Spirituality and Philosophy\n, Nature\n, Spirituality and Philosophy, Nature                                                                                                        1
Spirituality and Philosophy, History, Biography, and Social Science, Nature, Spirituality and Philosophy, History, Biography, and Social Science, Nature                            1
Nature, Nepali, Nature, Nepali                                                                                                                                                      1
Fiction and Literature, Fiction and Literature, Fiction and Literature, Nature, Fiction and Literature, Fiction and Literature, Fiction and Literature, Fiction and Literature      1
Travel, History, Biography, and Social Science, Travel, History, Biography, and Social Science                                                                                      1
Name: count, Length: 613, dtype: int64
In [23]:
df["Genre"].unique()
Out[23]:
array(['Arts And Photography', 'Business And Investing',
       'Fiction And Literature', 'Foreign Languages',
       'History Biography And Social Science', 'Kids And Teens',
       'Learning And Reference', 'Lifestyle And Wellness',
       'Manga And Graphic Novels', 'Miscellaneous', 'Nature', 'Nepali',
       'Political Science', 'Rare Coffee Table Books', 'Religion',
       'Self Improvement And Relationships',
       'Spirituality And Philosophy', 'Technology', 'Travel'],
      dtype=object)
In [ ]:
def preprocess_related_genres(related_genres: str) -> list[str]:
    unique_genres = list(df["Genre"].unique())
    genres = re.sub(r"History, Biography, and Social Science",
                    "History Biography And Social Science", related_genres).strip()

    genres = [genre.strip().title() for genre in genres.split(",")]
    extracted_genres = []
    for genre in genres:
        if genre in unique_genres and genre not in extracted_genres:
            extracted_genres.append(genre)

    return extracted_genres


print(df["Related Genres"][5])
print(preprocess_related_genres(df["Related Genres"][5]))
print(df["Related Genres"][1845])
print(preprocess_related_genres(df["Related Genres"][1845]))
Arts and Photography, Nepali, Arts and Photography, Nepali
['Arts And Photography', 'Nepali']
History, Biography, and Social Science, Nepali, History, Biography, and Social Science, Nepali
['History Biography And Social Science', 'Nepali']
In [25]:
df["Related Genres"] = df["Related Genres"].apply(preprocess_related_genres)
df[["Genre", "Related Genres"]]
Out[25]:
Genre Related Genres
0 Arts And Photography [Kids And Teens, Arts And Photography]
1 Arts And Photography [Arts And Photography, Miscellaneous]
2 Arts And Photography [Travel, Arts And Photography]
3 Arts And Photography [Arts And Photography, Self Improvement And Re...
4 Arts And Photography [Business And Investing, Arts And Photography,...
... ... ...
2835 Travel [Travel, Nepali]
2836 Travel [Fiction And Literature]
2837 Travel [History Biography And Social Science, Busines...
2838 Travel [History Biography And Social Science]
2839 Travel [Nepali, History Biography And Social Science]

2840 rows × 2 columns

In [26]:
df["Related Genres"].value_counts()
Out[26]:
Related Genres
[Nepali]                                                                                   139
[Fiction And Literature]                                                                   138
[Foreign Languages]                                                                        122
[Miscellaneous]                                                                            119
[Kids And Teens]                                                                           103
                                                                                          ... 
[Nature, Nepali]                                                                             1
[Spirituality And Philosophy, History Biography And Social Science, Nature]                  1
[Spirituality And Philosophy, Nature]                                                        1
[Self Improvement And Relationships, Spirituality And Philosophy, Arts And Photography]      1
[Fiction And Literature, Learning And Reference, History Biography And Social Science]       1
Name: count, Length: 321, dtype: int64

Sub-genres¶

In [ ]:
def preprocess_subgenres(genre_string: str) -> list[str]:
    genre_string_list = re.sub(r"\n", "", genre_string).strip().split(",")
    subgenres = [subgenre.strip() for subgenre in genre_string_list]
    extracted = []

    for subgenre in subgenres:
        if subgenre not in extracted:
            extracted.append(subgenre)

    return extracted


print(preprocess_subgenres(df["Subgenres"][5]))
print(preprocess_subgenres(df["Subgenres"][1455]))
['Picture Books', 'Books on Nepal']
['Books on India', 'Politics', 'History']
In [ ]:
df["Subgenres"] = df["Subgenres"][~df["Subgenres"].isna()] \
    .apply(preprocess_subgenres)

Genre Accuracy¶

Every book on BooksMandala has multiple genres. Although the data is scraped by genre, the obtained data may not represent the core genre of the book. A book's individual page has a "Related Genre" section on which are listed its multiple genres and which does closely describe the core genres of the book.

Take the book Big Magic by Elizabeth Gilbert for example. Since the data was extracted by looking for books through the genre categories (like this) available on the BooksMandala website, BooksMandala categorizes Big Magic under Arts and Photography, which isn't false. However, most websites like Goodreads (a website for book readers and recommendations) puts Big Magic under Self Help.

In [29]:
df[df["Title"] == "Big Magic"]
Out[29]:
Title Author Price Rating Limited Stock Discount Genre Number of Pages Weight ISBN Language Related Genres Subgenres Synopsis URL List Price Discount Amount
155 Big Magic Elizabeth Gilbert 798.0 NaN True 0.0 Arts And Photography 271.0 215 9781408886182 English [Self Improvement And Relationships, Arts And ... [Self Help, Art, Psychology, Memoir] Readers of all ages and walks of life have dra... https://booksmandala.com/books/big-magic-15312 798.0 0.0
In [30]:
df["Genre"] = df["Related Genres"].apply(lambda x: x[0] if x else df["Genre"])
In [31]:
df[df["Title"] == "Big Magic"]
Out[31]:
Title Author Price Rating Limited Stock Discount Genre Number of Pages Weight ISBN Language Related Genres Subgenres Synopsis URL List Price Discount Amount
155 Big Magic Elizabeth Gilbert 798.0 NaN True 0.0 Self Improvement And Relationships 271.0 215 9781408886182 English [Self Improvement And Relationships, Arts And ... [Self Help, Art, Psychology, Memoir] Readers of all ages and walks of life have dra... https://booksmandala.com/books/big-magic-15312 798.0 0.0

Here, by taking the first genre listed on the book's webpage, we get better and closer categorization of the genre of the book.

See: Big Magic on BooksMandala

Duplicates¶

In [33]:
df.drop_duplicates(subset=['Title', 'ISBN'], keep='last', inplace=True)

Null values¶

In [34]:
na = df.isna().sum()
na[na > 0]
Out[34]:
Rating             2087
Number of Pages     188
Synopsis              5
dtype: int64
In [35]:
(df["Number of Pages"].isna().sum() / df.shape[0]) * 100
Out[35]:
8.355555555555554
In [36]:
(df["Rating"].isna().sum() / df.shape[0]) * 100
Out[36]:
92.75555555555556
In [37]:
df.drop("Rating", axis=1, inplace=True)

Synopsis¶

In [ ]:
def clean_synopses(text: str) -> str | None:
    default_patterns = [
        r"A description of this book has not been provided",
        r"is available for purchase at Books Mandala",
        r"No description available",
        r"^\s*\d+\s*$"  # detects only numbers and whitespaces
    ]

    for pattern in default_patterns:
        if re.search(pattern, str(text), re.IGNORECASE):
            return None

    return text


df["Synopsis"] = df["Synopsis"].apply(clean_synopses)

Instead of dropping null values, I can try to get the synopses using Google Books API.

In [ ]:
import requests
from dotenv import load_dotenv
import os

load_dotenv()


def get_synopsis(isbn: str) -> str | None:
    # GET YOUR OWN API KEY!!!
    response = requests.get(
        f"https://www.googleapis.com/books/v1/volumes?q=isbn:{isbn}&key={os.getenv('API_KEY')}")
    data = response.json()

    if "items" not in data or len(data["items"]) == 0:
        return None

    new_response = requests.get(data["items"][0].get("selfLink", {}))
    new_data = new_response.json()
    new_data = new_data.get("volumeInfo", {})
    return new_data.get("description", None)


def fill_synopsis(row):
    if not pd.isna(row["Synopsis"]):
        return row["Synopsis"]

    return get_synopsis(row["ISBN"])


df[df["Synopsis"].isna()]["ISBN"]
Out[ ]:
11       9789993347972
15      BM35556B95F6BE
27       9789386671769
28       9789937050098
37       9788177696479
             ...      
2776     9781838952730
2778    BM13596E34E338
2812     9789815204681
2817    BMD38A20C2904F
2839     9789353570132
Name: ISBN, Length: 321, dtype: object
In [40]:
df['Synopsis'] = df.apply(fill_synopsis, axis=1)
In [41]:
df["Synopsis"].isna().sum()
Out[41]:
308
In [42]:
df[df["Synopsis"].isna()]
Out[42]:
Title Author Price Limited Stock Discount Genre Number of Pages Weight ISBN Language Related Genres Subgenres Synopsis URL List Price Discount Amount
11 pokhara y el annapurna Dinesh. Shrestha 695.0 False 0.0 Arts And Photography NaN 250 9789993347972 English [Arts And Photography] [Photography and Filmmaking] None https://booksmandala.com/books/pokhara-y-el-an... 695.0 0.0
15 Tibetan Children's Colouring Book Unknown 256.0 False 0.0 Kids And Teens 16.0 344 BM35556B95F6BE English [Kids And Teens, Arts And Photography] [Coloring for Children, Coloring Books] None https://booksmandala.com/books/tibetan-childre... 256.0 0.0
27 Solimo Copy Colour Pack, Set of 6 Books Unassigned 960.0 False 0.0 Arts And Photography NaN 550 9789386671769 English [Arts And Photography] [Coloring Books] None https://booksmandala.com/books/solimo-copy-col... 960.0 0.0
28 Color Nepal coloring book for all Laibari 700.0 True 0.0 Arts And Photography NaN 260 9789937050098 English [Arts And Photography] [Coloring Books] None https://booksmandala.com/books/color-nepal-col... 700.0 0.0
37 Mandala Colouring Book Unassigned 300.0 True 0.0 Arts And Photography NaN 250 9788177696479 English [Arts And Photography] [Art] None https://booksmandala.com/books/mandala-colouri... 300.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2774 The World Pocket Atlas Unassigned 742.0 True 0.0 Travel NaN 250 9788182525160 English [Travel] [Atlas] None https://booksmandala.com/books/the-world-pocke... 742.0 0.0
2776 Everest 1922 Mick Conefrey 798.0 True 0.0 Travel 310.0 245 9781838952730 English [Travel] [Climbing and Mountaineering] None https://booksmandala.com/books/everest-1922-49955 798.0 0.0
2778 Destination Nepal Rabindra Dhoju 110.0 False 0.0 Travel 24.0 60 BM13596E34E338 English [Travel] [Travel Guide Books] None https://booksmandala.com/books/destination-nep... 110.0 0.0
2817 The Pokhara Valley Rabinthara Dhoju 110.0 False 0.0 Travel 24.0 100 BMD38A20C2904F English [Travel] [Travel Guide Books] None https://booksmandala.com/books/the-pokhara-val... 110.0 0.0
2839 INSIDE NEPAL/THE WALK-IN. Amar. Bhushan 560.0 False 0.0 Nepali 189.0 175 9789353570132 English [Nepali, History Biography And Social Science] [Books on Nepal, History] None https://booksmandala.com/books/inside-nepalthe... 560.0 0.0

308 rows × 16 columns

In [43]:
df.dropna(subset=["Synopsis"], inplace=True)
In [44]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1942 entries, 3 to 2838
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Title            1942 non-null   object 
 1   Author           1942 non-null   object 
 2   Price            1942 non-null   float64
 3   Limited Stock    1942 non-null   bool   
 4   Discount         1942 non-null   float64
 5   Genre            1942 non-null   object 
 6   Number of Pages  1888 non-null   float64
 7   Weight           1942 non-null   int64  
 8   ISBN             1942 non-null   object 
 9   Language         1942 non-null   object 
 10  Related Genres   1942 non-null   object 
 11  Subgenres        1942 non-null   object 
 12  Synopsis         1942 non-null   object 
 13  URL              1942 non-null   object 
 14  List Price       1942 non-null   float64
 15  Discount Amount  1942 non-null   float64
dtypes: bool(1), float64(5), int64(1), object(9)
memory usage: 244.6+ KB

Visualizing the distribution of Number of Pages before imputing

Number of Pages¶

In [ ]:
fig = px.box(df,
             x="Number of Pages",
             title="Distribution of Number of Pages")
fig.update_layout(bargap=0.1)
fig.show()
In [ ]:
df.fillna({"Number of Pages": df["Number of Pages"].median()}, inplace=True)
In [ ]:
fig = px.box(df,
             x="Number of Pages",
             title="Distribution of Number of Pages")
fig.update_layout(bargap=0.1)
fig.show()

Export cleaned data¶

In [ ]:
df = df[['Title', 'Author', 'Price', 'List Price',
         'Discount Amount', 'Limited Stock', 'Discount',
         'Genre', 'Number of Pages', 'Weight', 'ISBN', 'Language',
         'Related Genres', 'Subgenres', 'Synopsis', 'URL']].copy()
df.rename({"Weight": "Weight(grams)"}, inplace=True)
df.to_csv("data/dataset_cleaned.csv", index=False)

Import cleaned data¶

In [2]:
df = pd.read_csv(
    "/home/am/booksmandala-data-analytics/notebooks/data/dataset_cleaned.csv")
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1942 entries, 0 to 1941
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Title            1942 non-null   object 
 1   Author           1939 non-null   object 
 2   Price            1942 non-null   float64
 3   List Price       1942 non-null   float64
 4   Discount Amount  1942 non-null   float64
 5   Limited Stock    1942 non-null   bool   
 6   Discount         1942 non-null   float64
 7   Genre            1942 non-null   object 
 8   Number of Pages  1942 non-null   float64
 9   Weight           1942 non-null   int64  
 10  ISBN             1942 non-null   object 
 11  Language         1942 non-null   object 
 12  Related Genres   1942 non-null   object 
 13  Subgenres        1942 non-null   object 
 14  Synopsis         1942 non-null   object 
 15  URL              1942 non-null   object 
dtypes: bool(1), float64(5), int64(1), object(9)
memory usage: 229.6+ KB

Clean and convert¶

In [4]:
df["Author"].isna().sum()
Out[4]:
3
In [5]:
df.dropna(subset=["Author"], inplace=True)
In [6]:
def clean_and_convert_to_list(value):
    if isinstance(value, str):
        value = value.strip()
        if value.startswith('[') and value.endswith(']'):
            try:
                return ast.literal_eval(value)
            except (ValueError, SyntaxError):
                return []
    elif isinstance(value, list):
        return value

    return []


df['Related Genres'] = df['Related Genres'].apply(clean_and_convert_to_list)
df['Subgenres'] = df['Subgenres'].apply(clean_and_convert_to_list)

Exploratory Analysis and Visualizations¶

In [7]:
df.describe()
Out[7]:
Price List Price Discount Amount Discount Number of Pages Weight
count 1939.000000 1939.000000 1939.000000 1939.000000 1939.000000 1939.000000
mean 1034.083703 1037.166065 3.082362 0.004513 286.327488 385.328520
std 1217.404008 1216.543176 26.548347 0.035758 195.250736 421.485496
min 60.000000 60.000000 0.000000 0.000000 10.000000 1.000000
25% 560.000000 560.000000 0.000000 0.000000 192.000000 210.000000
50% 800.000000 800.000000 0.000000 0.000000 260.000000 285.000000
75% 1118.000000 1118.000000 0.000000 0.000000 347.000000 412.500000
max 25598.000000 25598.000000 432.000000 0.400000 3766.000000 8000.000000
In [8]:
avg_data = df.groupby('Genre').agg(
    {
        'Price': 'mean',
        'Number of Pages': 'mean',
        'ISBN': 'count'
    }
).reset_index()
avg_data.columns = ['Genre', 'Average Price',
                    'Average Page Count', 'Number of Books']

avg_data
Out[8]:
Genre Average Price Average Page Count Number of Books
0 Arts And Photography 1983.400000 191.708333 120
1 Business And Investing 1045.455189 313.669811 212
2 Fiction And Literature 896.280398 320.732955 352
3 Foreign Languages 640.125000 304.281250 32
4 History Biography And Social Science 1090.775494 348.316206 253
5 Kids And Teens 875.896000 131.320000 125
6 Learning And Reference 1138.465517 368.965517 58
7 Lifestyle And Wellness 1028.132353 304.911765 68
8 Manga And Graphic Novels 2138.722892 317.277108 83
9 Miscellaneous 615.397727 285.034091 88
10 Nature 996.758621 252.913793 58
11 Nepali 594.190083 252.991736 121
12 Rare Coffee Table Books 2233.500000 86.000000 2
13 Religion 778.200000 267.416667 60
14 Self Improvement And Relationships 846.029697 264.133333 165
15 Spirituality And Philosophy 775.662338 275.220779 77
16 Technology 1202.320000 267.720000 25
17 Travel 1153.900000 299.900000 40
In [9]:
Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1

# values outside 1.5 * IQR from Q1 and Q3
no_outlier_df = df[(df['Price'] >= (Q1 - 1.5 * IQR)) & (df['Price'] <= (Q3 + 1.5 * IQR))]
In [10]:
print("Before removing outliers: ", df.shape)
print("After removing outliers:", no_outlier_df.shape)
Before removing outliers:  (1939, 16)
After removing outliers: (1789, 16)

Histograms¶

In [11]:
fig = px.histogram(df,
                   x="Price",
                   marginal="box",
                   title="Distribution of Book Prices")
fig.update_layout(bargap=0.1)
fig.show()

fig = px.histogram(no_outlier_df,
                   x="Price",
                   marginal="box",
                   title="Distribution of Book Prices (after removing outliers)")
fig.update_layout(bargap=0.1)
fig.show()
In [12]:
fig = px.histogram(df,
                   x="Weight",
                   marginal="box",
                   title="Distribution of Book Weight")
fig.update_layout(bargap=0.1)
fig.show()
In [13]:
fig = px.histogram(avg_data,
                y='Average Price',
                x='Genre',
                title="Average Price by Genre",
                log_y=True)
fig.add_trace(go.Scatter(x=avg_data['Genre'], 
                         y=avg_data['Average Price'], 
                         mode='lines', 
                         name='Average Price Trend',
                         line=dict(color='DarkSlateGrey', width=1)))
fig.update_layout(height=500, showlegend=False, bargap=0.1)
fig.show()

Bar charts¶

In [14]:
fig = px.bar(df['Genre'].value_counts().reset_index(),
             x='Genre',
             y='count',
             title='Number of Books by Genre')
fig.update_layout(height=500, bargap=0.1)
# fig.add_trace(go.Scatter(x=df['Genre'].value_counts().reset_index()['Genre'], 
#                          y=df['Genre'].value_counts().reset_index()['count'], 
#                          mode='lines', 
#                          name='x',
#                          line=dict(color='DarkSlateGrey', width=1)))
fig.show()
In [15]:
fig = px.bar(df['Related Genres'].explode().value_counts().reset_index(),
             x='Related Genres',
             y='count',
             title='Number of Books by Related Genre (Inclusive)')
fig.update_layout(height=500, bargap=0.1)
fig.show()
In [16]:
fig = px.bar(df['Subgenres'].explode().value_counts().reset_index()[:20],
             x='Subgenres',
             y='count',
             title='Number of Books by Top 20 Subgenres')
fig.update_layout(height=500)
fig.show()
In [17]:
limited_books = df.groupby(["Genre", "Limited Stock"]
                           ).size().reset_index(name='count')

fig = px.bar(limited_books,
             x='Genre',
             y='count',
             color='Limited Stock',
             title='Genres by Limited Stock',
             barmode='group')
fig.update_layout(height=500)
fig.show()
In [18]:
top_authors = df["Author"].value_counts().reset_index()
top_authors
Out[18]:
Author count
0 Unassigned 25
1 Jeff Kinney 24
2 Hergé 23
3 Thich Nhat Hanh 16
4 Kentaro Miura 16
... ... ...
1337 Yoshitoki Oima 1
1338 Luo Di Cheng Qiu 1
1339 Jim Starlin 1
1340 Negi Haruba 1
1341 Peter Matthiessen 1

1342 rows × 2 columns

In [19]:
fig = px.bar(top_authors[1:11],
             x='Author',
             y='count',
             title='Top Authors by Number of Books')
fig.update_xaxes(title_text="Authors")
fig.update_yaxes(title_text="Number of Books")
fig.show()
In [20]:
fig = px.bar(df.sort_values(by='Price', ascending=False)[:10], 
             x='Title', 
             y='Price', 
             title='Top 10 Most Expensive Books', 
             color='Title')
fig.update_layout(height=500)
fig.update_xaxes(showticklabels=False)
fig.show()

Pie charts¶

In [21]:
fig = px.pie(df["Language"].value_counts().reset_index(),
             values='count',
             names='Language',
             hole=0.4,
             title='Distribution of Books by Language')
fig.update_layout(height=600)
fig.show()
In [22]:
fig = px.pie(df["Genre"].value_counts().reset_index(),
             values='count',
             names='Genre',
             title='Distribution of Books by Genres',
             hole=0.4,
             color_discrete_sequence=px.colors.qualitative.Set1)
fig.update_layout(height=600)
fig.show()

Scatter Plots¶

In [23]:
fig = px.scatter(df,
                 x='Price',
                 y='Discount Amount',
                 title='Price vs. Discount Amount',
                 color='Discount',
                 hover_data=["Title", "Limited Stock", "Discount"])
fig.update_traces(marker=dict(size=5),
                  selector=dict(mode='markers'))
fig.show()
In [24]:
fig = px.scatter(df,
                 x='Weight',
                 y='Price',
                 log_y=True,
                 title='Price vs. Weight',
                 color='Limited Stock',
                 hover_data=["Limited Stock", "Genre"])
fig.update_traces(marker=dict(size=5, line=dict(width=0.4, color='DarkSlateGray')))
fig.show()
In [25]:
fig = px.scatter(no_outlier_df,
                 x='Weight',
                 y='Price',
                 log_y=True,
                 title='Price vs. Weight (No Outliers)',
                 color='Limited Stock',
                 hover_data=["Limited Stock", "Genre"])
fig.update_traces(marker=dict(size=5, line=dict(width=0.4, color='DarkSlateGray')))
fig.show()

Box Plots¶

In [26]:
fig = px.violin(df,
                x='Genre',
                y='Price',
                color="Genre",
                log_y=True,
                title='Prices Distribution by Genre',)
fig.update_layout(height=500, showlegend=False)
fig.show()
In [27]:
fig = px.box(df,
             x='Genre',
             y='Number of Pages',
             color='Genre',
             log_y=True,
             title="Page Count Distribution by Genre")
fig.update_layout(height=500, showlegend=False)
fig.show()

Bubble Chart¶

In [28]:
fig = px.scatter(avg_data,
                 x='Average Price',
                 y='Average Page Count',
                 size='Number of Books',
                 color='Genre',
                 title="Genre Analysis: Price vs. Page Count",
                 color_discrete_sequence=px.colors.qualitative.Set1)
fig.update_layout(height=600)
fig.update_xaxes(title_text="Average Price")
fig.update_yaxes(title_text="Average Page Count")
fig.update_traces(marker=dict(line=dict(width=1,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.show()
In [29]:
fig = px.scatter(avg_data, 
                   x='Number of Books',
                   y='Average Price',
                   color='Genre',
                   size='Average Price',
                   title='Book Count vs Price')
fig.update_layout(height=500)
fig.update_traces(marker=dict(line=dict(width=1,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.show()
In [30]:
fig = px.scatter_3d(avg_data,
                    x='Average Price',
                    y='Number of Books',
                    z='Average Page Count',
                    color='Genre',
                    title="3D Scatter Plot of Price, Book Count, and Page Count")
fig.update_layout(height=600)
fig.update_traces(marker=dict(size=4, 
                              line=dict(width=1,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.show()

Correlation¶

In [31]:
# though this doesn't reveal anything
correlation = df[['Price', 'Number of Pages',
                  'Weight', 'Limited Stock']].corr()

fig = px.imshow(correlation,
                text_auto=True,
                color_continuous_scale='Sunsetdark', 
                title='Correlation Heatmap')
fig.update_layout(
    title='Correlation Heatmap',
    xaxis_title='Features',
    yaxis_title='Features',
    height=600,
    width=800
)

fig.show()

Word Cloud¶

In [32]:
text = ' '.join(synopsis for synopsis in df["Synopsis"].dropna())

custom_stopwords = set(STOPWORDS)
custom_stopwords.update(["book", "author", "words", "common", "u"])
wordcloud = WordCloud(stopwords=custom_stopwords,
                      width=800,
                      height=400,
                      background_color='white',
                      colormap='viridis').generate(text)
fig = px.imshow(wordcloud, text_auto=True)
fig.update_layout(height=500)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()

Modeling and Machine Learning¶

Unknown authors heavily affect the recommendation system when recommendations value Authors as well

In [7]:
df = df[df["Author"] != "Unassigned"]
In [8]:
df.columns
Out[8]:
Index(['Title', 'Author', 'Price', 'List Price', 'Discount Amount',
       'Limited Stock', 'Discount', 'Genre', 'Number of Pages', 'Weight',
       'ISBN', 'Language', 'Related Genres', 'Subgenres', 'Synopsis', 'URL'],
      dtype='object')
In [9]:
df.drop_duplicates(subset=["ISBN"], keep='last', inplace=True)

Genres and Subgenres¶

In [10]:
df[["Related Genres", "Subgenres"]]
Out[10]:
Related Genres Subgenres
0 [Arts And Photography, Self Improvement And Re... [Music, Self Help, Psychology]
1 [Arts And Photography, Nepali] [Picture Books, Books on Nepal]
2 [History Biography And Social Science, Arts An... [Biography, Memoir, Art]
3 [Arts And Photography, Learning And Reference] [Architecture, Science]
4 [History Biography And Social Science, Arts An... [History, Design, Art, Science]
... ... ...
1937 [History Biography And Social Science] [Memoir]
1938 [Travel, Nepali] [Climbing and Mountaineering, Books on Nepal]
1939 [Fiction And Literature] [Classics, Contemporary]
1940 [History Biography And Social Science, Busines... [Memoir, Biography, Business]
1941 [History Biography And Social Science] [Autobiography]

1913 rows × 2 columns

In [11]:
unique_subgenres = []
max_subgenre_len = 0

for subgenre in df["Subgenres"]:
    try:
        if len(subgenre) > max_subgenre_len:
            max_subgenre_len = len(subgenre)
        for nested_item in subgenre:
            if nested_item not in unique_subgenres:
                unique_subgenres.append(nested_item)
    except:
        pass


print(unique_subgenres)
print(len(unique_subgenres))
print("Max Len: ", max_subgenre_len)
['Music', 'Self Help', 'Psychology', 'Picture Books', 'Books on Nepal', 'Biography', 'Memoir', 'Art', 'Architecture', 'Science', 'History', 'Design', 'Business', 'Stress Management', 'Philosophy', 'Ages 3 to 5', 'Coloring Books', 'Management', 'Leadership', 'Ages 6 to 8', 'Childrens', 'Action and Adventure', 'Fashion', 'Fantasy', 'Romance', 'Young Adult', 'Poetry and Prose', 'Autobiography', 'Mindfulness', 'Photography and Filmmaking', 'Science Fiction', 'Contemporary', 'Humor', 'Classics', 'Economics', 'Sociology', 'Ages 9 to 12', 'Card Games', 'Short Story', 'Buddhism', 'Productivity', 'Time Management', 'Finance', 'Biology', 'Investing', 'Feminism', 'Marketing and Sales', 'Politics', 'Money', 'Asian Literature', 'Communication and Social Skills', 'Mental Health', 'Japanese Literature', 'Adult Fiction', 'Drama', 'Military Fiction', 'Historical Fiction', 'Womens Fiction', 'LGBTQIA+', 'Mystery', 'Thriller and Suspense', 'Crime', 'Coming of Age', 'Horror', 'Chick lit', 'French', 'Hindi', 'Hinduism', 'Japanese', 'Russian', 'German', 'Osho', 'Chinese', 'Language', 'Linguistics and Writing', 'Language Books', 'Society and Culture', 'True Crime', 'Anthology', 'Medicine', 'Baby to 2', 'Teens and Young Adult', 'Children Activities and Crafts', 'Nepali Language', 'Nepali Children Book', 'Coloring for Children', 'Parenting and Relationships', 'Neuroscience', 'Dictionaries', 'Puzzles', 'Geography', 'Current Affairs', 'Mathematics', 'Motivational', 'Health', 'Food and Drinks', 'Football', 'Sports', 'Meditation and Yoga', 'Pregnancy and Childbirth', 'Card Decks and Oracles', 'Healing', 'Cookbooks', 'Diary', 'Journal', 'Quotes', 'Mythology', 'Tarot', 'Comics', 'Manga', 'Graphic Novels', 'Books on Tibet', 'Books of Bangladesh', 'Books on India', 'Environment', 'Trees and Plants', 'Encyclopedias', 'Animals and Pets', 'Gems and Jewelleries', 'Climbing and Mountaineering', 'Travel Guide Books', 'Astrology', 'Books On Himalayas', 'Nepali Literature', 'Paranormal', 'Islam', 'Modern Classic', 'Christianity', 'Anthropology', 'Sex', 'Computers and Internet', 'Artificial Intelligence', 'BlockChain Technology', 'Programming', 'Engineering', 'Law', 'Journalism', 'Atlas', 'British Literature']
139
Max Len:  8
In [12]:
max_relgenres_len = 0
for genres in df["Related Genres"]:
    if len(genres) > max_relgenres_len:
        max_relgenres_len = len(genres)

print("Max Related Genres Length: ", max_relgenres_len)
Max Related Genres Length:  5
In [13]:
df["Related Genres"].explode().value_counts()
Out[13]:
Related Genres
History Biography And Social Science    546
Fiction And Literature                  411
Self Improvement And Relationships      309
Spirituality And Philosophy             268
Business And Investing                  261
Learning And Reference                  226
Kids And Teens                          175
Arts And Photography                    160
Manga And Graphic Novels                154
Nepali                                  152
Lifestyle And Wellness                  142
Religion                                135
Miscellaneous                           113
Nature                                  112
Travel                                   90
Technology                               71
Foreign Languages                        46
Rare Coffee Table Books                   5
Name: count, dtype: int64
In [14]:
df["Subgenres"].explode().value_counts()
Out[14]:
Subgenres
Self Help               279
Philosophy              239
Business                236
History                 199
Science                 188
                       ... 
Geography                 1
Astrology                 1
Gems and Jewelleries      1
Encyclopedias             1
British Literature        1
Name: count, Length: 139, dtype: int64

Encoding¶

In [15]:
mlb_related = MultiLabelBinarizer()
related_genres_encoded = mlb_related.fit_transform(df["Related Genres"])
In [16]:
mlb_subgenres = MultiLabelBinarizer()
subgenres_encoded = mlb_subgenres.fit_transform(df["Subgenres"])
In [17]:
display(pd.DataFrame(
    related_genres_encoded, columns=mlb_related.classes_))
Arts And Photography Business And Investing Fiction And Literature Foreign Languages History Biography And Social Science Kids And Teens Learning And Reference Lifestyle And Wellness Manga And Graphic Novels Miscellaneous Nature Nepali Rare Coffee Table Books Religion Self Improvement And Relationships Spirituality And Philosophy Technology Travel
0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
2 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
4 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1908 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1909 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
1910 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1911 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1912 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

1913 rows × 18 columns

In [18]:
display(pd.DataFrame(subgenres_encoded, columns=mlb_subgenres.classes_))
Action and Adventure Adult Fiction Ages 3 to 5 Ages 6 to 8 Ages 9 to 12 Animals and Pets Anthology Anthropology Architecture Art ... Stress Management Tarot Teens and Young Adult Thriller and Suspense Time Management Travel Guide Books Trees and Plants True Crime Womens Fiction Young Adult
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1908 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1909 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1910 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1911 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1912 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

1913 rows × 139 columns

Encoding¶

Using paraphrase-multilingual-MiniLM-L12-v2 from SentenceTransformers to encode synopses.

In [19]:
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
In [20]:
df.reset_index(inplace=True, drop=True)
synopsis_embeddings = model.encode(df["Synopsis"], show_progress_bar=True)

Combine and Add Weights¶

In [21]:
weights = {
    "genre_weight": 0.2,  # related genres weight
    "subgenre_weight": 0.1,
    "synopsis_weight": 0.4
}

synopsis_embeddings_matrix = np.vstack(synopsis_embeddings)

weighted_related_genres = weights["genre_weight"] * related_genres_encoded
weighted_subgenres = weights["subgenre_weight"] * subgenres_encoded
weighted_synopses = weights["synopsis_weight"] * synopsis_embeddings_matrix

# scaling the synopsis (dense matrix)
scaler = StandardScaler(with_mean=False)
synopsis_embeddings_scaled = scaler.fit_transform(weighted_synopses)

combined_weighted_features = hstack([
    csr_matrix(weighted_related_genres),
    csr_matrix(weighted_subgenres),
    csr_matrix(synopsis_embeddings_scaled),
])
In [22]:
print(combined_weighted_features)
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 742520 stored elements and shape (1913, 541)>
  Coords	Values
  (0, 0)	0.2
  (0, 4)	0.2
  (0, 14)	0.2
  (0, 116)	0.1
  (0, 134)	0.1
  (0, 141)	0.1
  (0, 157)	1.784592866897583
  (0, 158)	-0.4816299080848694
  (0, 159)	-0.26151272654533386
  (0, 160)	-0.4406612813472748
  (0, 161)	-1.638753890991211
  (0, 162)	0.08578472584486008
  (0, 163)	1.0713781118392944
  (0, 164)	0.6499630808830261
  (0, 165)	0.9495714902877808
  (0, 166)	0.268752783536911
  (0, 167)	0.04838496446609497
  (0, 168)	0.03733789920806885
  (0, 169)	0.5523092150688171
  (0, 170)	-2.791269063949585
  (0, 171)	1.151781678199768
  (0, 172)	0.28092601895332336
  (0, 173)	0.03787020221352577
  (0, 174)	1.4218069314956665
  (0, 175)	-1.508204460144043
  :	:
  (1912, 516)	-0.4690714180469513
  (1912, 517)	1.0671030282974243
  (1912, 518)	0.30465206503868103
  (1912, 519)	1.0742019414901733
  (1912, 520)	2.244892120361328
  (1912, 521)	0.0510968416929245
  (1912, 522)	-8.167990017682314e-05
  (1912, 523)	-0.5710775256156921
  (1912, 524)	-0.15032102167606354
  (1912, 525)	0.027843547984957695
  (1912, 526)	2.2979421615600586
  (1912, 527)	-0.9206590056419373
  (1912, 528)	1.8066314458847046
  (1912, 529)	0.3503602147102356
  (1912, 530)	-2.2840702533721924
  (1912, 531)	-0.3761787712574005
  (1912, 532)	-0.22271381318569183
  (1912, 533)	0.24022667109966278
  (1912, 534)	-1.2844871282577515
  (1912, 535)	-2.121735095977783
  (1912, 536)	0.15947629511356354
  (1912, 537)	-0.06682594120502472
  (1912, 538)	-0.6463495492935181
  (1912, 539)	-2.1301190853118896
  (1912, 540)	1.453000545501709

Calculate Similarity¶

Cosine Similarity¶

Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space, quantifying how similar the two vectors are irrespective of their magnitude. It ranges from -1 (completely dissimilar) to 1 (identical), with 0 indicating orthogonality (no similarity).

$$ \text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|} $$

where,

  • $A \cdot B$ is the dot product of two vectors, $A$ and $B$
  • $\|A\|$ and $\|B\|$ are the magnitude of the vectors $A$ and $B$
In [23]:
similarity_matrix = cosine_similarity(combined_weighted_features)
similarity_df = pd.DataFrame(
    similarity_matrix, index=df['Title'], columns=df['Title'])
In [24]:
similarity_df.head()
Out[24]:
Title The Inner Game of Music The Nepalis a pictorial celebration Lust for Life The Architecture Book The World According to Colour Design Your Thinking Wabi Sabi : The Wisdom In Imperfection The Art Book Colouring book : Copy Colour Fruits and Vegetables Creativity, Inc.: Overcoming the Unseen Forces That Stand in the Way of True Inspiration ... The Time Keeper The Climb The Life and Times of the Thunderbolt Kid Storms of Silence Annapurna A glimpse of eternal snows Tourism in Pokhara Satori in Paris Leaving Microsoft to Change the World The Snow Leopard
Title
The Inner Game of Music 1.000000 0.034147 0.176279 0.201509 0.278708 0.300259 0.230279 0.251498 0.125580 0.124834 ... 0.098426 0.137469 0.144421 0.244405 0.050204 0.138618 0.170119 0.139562 0.183268 0.137902
The Nepalis a pictorial celebration 0.034147 1.000000 0.118702 0.082047 0.301799 0.136164 0.195004 0.358356 0.129706 0.165313 ... 0.088339 0.186285 0.170930 0.075160 0.436631 0.332246 0.399275 0.275086 0.219800 0.233057
Lust for Life 0.176279 0.118702 1.000000 0.264988 0.326018 0.110824 0.163250 0.371939 0.253391 0.257954 ... 0.250244 0.404889 0.427070 0.278486 0.144082 0.252875 0.325080 0.380746 0.283003 0.205666
The Architecture Book 0.201509 0.082047 0.264988 1.000000 0.318694 0.245561 0.203262 0.488973 0.211905 0.159666 ... 0.179159 0.259368 0.222482 0.132534 0.077794 0.236956 0.293476 0.221182 0.191667 0.180547
The World According to Colour 0.278708 0.301799 0.326018 0.318694 1.000000 0.211711 0.236022 0.417661 0.487917 0.188843 ... 0.220100 0.126099 0.304179 0.184394 0.093187 0.327152 0.314607 0.312270 0.206742 0.293267

5 rows × 1913 columns

In [25]:
type(similarity_matrix)
Out[25]:
numpy.ndarray

Spectral Clustering¶

In [26]:
db_scores = []
cluster_range = range(2, 20) 

for n_clusters in cluster_range:
    spectral = SpectralClustering(n_clusters, affinity='precomputed', random_state=42)
    labels = spectral.fit_predict(similarity_matrix)
    score = davies_bouldin_score(similarity_matrix, labels)
    db_scores.append(score)

print(db_scores)
[2.310745480641241, 2.6262955249402484, 2.0825395549022927, 2.3150347844418984, 2.131239058192411, 2.238069358487908, 2.2495851492530963, 2.3740701938440525, 2.4610600540217056, 2.5060761997933674, 2.517114030275109, 2.6979313709881563, 2.3956730612528703, 2.411985376934093, 2.8241969883224582, 2.538603845476868, 2.405740148752897, 2.542586944219207]
In [27]:
fig = px.line(x=cluster_range, 
              y=db_scores,
              title="Davies Bouldin Index")
fig.update_layout(xaxis_title="Cluster Range", 
                  yaxis_title="Davies Bouldin Score")
fig.show()
In [28]:
n_clusters = 4
spectral_clustering = SpectralClustering(n_clusters=n_clusters, 
                                         affinity='precomputed', 
                                         assign_labels='kmeans', 
                                         random_state=42)
labels = spectral_clustering.fit_predict(similarity_matrix)
In [29]:
import umap

umap_reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42, metric='cosine')
embedding = umap_reducer.fit_transform(similarity_matrix)

fig = px.scatter(x=embedding[:, 0], 
                 y=embedding[:, 1],
                 color=labels.astype(str),
                 title='Spectral Clustering of Books',
                 labels={'color': 'Clusters'})
fig.update_layout(xaxis_title="UMAP 1", 
                  yaxis_title="UMAP 2",
                  height=500)
fig.show()
In [30]:
df["Spectral Cluster"] = labels
In [31]:
def get_recommendations(similarity_df: pd.DataFrame,
                        df: pd.DataFrame,
                        title: str,
                        n: int = 5,
                        columns: list[str] = [
                            "Title", "Author", "Genre", "Related Genres", "Subgenres", "Spectral Cluster"]
                        ) -> pd.DataFrame:
    idx = similarity_df.index.get_loc(title)
    sim_scores = list(enumerate(similarity_df.iloc[idx]))

    # sort the books based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # exclude the first one as it's the book itself
    sim_scores = sim_scores[1:n+1]

    # get the book indices
    book_indices = [i[0] for i in sim_scores]
    display(df[df["Title"] == title][columns])
    # top n most similar books
    return df[columns].iloc[book_indices]
In [32]:
get_recommendations(similarity_df,
                    df,
                    "Diary of a Wimpy Kid",
                    n=10)
Title Author Genre Related Genres Subgenres Spectral Cluster
382 Diary of a Wimpy Kid Jeff Kinney Fiction And Literature [Fiction And Literature, Kids And Teens] [Humor, Ages 9 to 12] 0
Out[32]:
Title Author Genre Related Genres Subgenres Spectral Cluster
339 Diary Of A Wimpy Kid ; The Ugly Truth Jeff Kinney Kids And Teens [Kids And Teens] [Ages 9 to 12] 0
781 Diary Of A Wimpy Kid: No Brainer Jeff Kinney Kids And Teens [Kids And Teens, Fiction And Literature, Manga... [Childrens, Humor, Graphic Novels] 0
360 Diary of a Wimpy Kid: Wrecking Ball Jeff Kinney Kids And Teens [Kids And Teens] [Ages 9 to 12] 0
353 Cabin Fever Jeff Kinney Kids And Teens [Kids And Teens] [Ages 9 to 12] 0
735 Diary of an Awesome Friendly Kid: Rowley Jeffe... Jeff Kinney Fiction And Literature [Fiction And Literature, Kids And Teens, Manga... [Humor, Young Adult, Childrens, Graphic Novels... 0
656 DIARY OF A WIMPY KID: THE GETAWAY Jeff Kinney Fiction And Literature [Fiction And Literature, Kids And Teens, Manga... [Humor, Childrens, Graphic Novels] 0
357 Rodrick Rules Jeff Kinney Kids And Teens [Kids And Teens] [Ages 6 to 8, Ages 9 to 12] 0
431 Diary Of A Wimpy Kid ; Hard Luck Jeff Kinney Kids And Teens [Kids And Teens] [Ages 9 to 12] 0
369 Diary of a Wimpy Kid 10. Old School Jeff Kinney Fiction And Literature [Fiction And Literature, Kids And Teens] [Young Adult, Humor, Ages 9 to 12] 0
424 Diary of a Wimpy Kid: Diper Overlode (Book 17) Jeff Kinney Kids And Teens [Kids And Teens] [Ages 9 to 12] 0
In [33]:
get_recommendations(similarity_df, df, "Satori in Paris")
Title Author Genre Related Genres Subgenres Spectral Cluster
1910 Satori in Paris Jack Kerouac Fiction And Literature [Fiction And Literature] [Classics, Contemporary] 3
Out[33]:
Title Author Genre Related Genres Subgenres Spectral Cluster
1870 Lonesome Traveler Jack Kerouac Fiction And Literature [Fiction And Literature] [Classics] 0
1602 Metamorphosis and Other Stories Franz Kafka Fiction And Literature [Fiction And Literature, Spirituality And Phil... [Short Story, Classics, Philosophy] 0
1553 Metamorphosis and Other Stories Franz Kafka Fiction And Literature [Fiction And Literature, Spirituality And Phil... [Modern Classic, Philosophy] 0
1149 Ijajatpatra Sarthak Karki Nepali [Nepali] [Nepali Literature] 0
271 Sakshi Chetna: Amrita Pritam Rajesh Chandra History Biography And Social Science [History Biography And Social Science, Foreign... [Memoir, Hindi] 3

K-Nearest Neighbors¶

Model¶

In [34]:
k = 15  # number of recommendations (neighbors)

# KNN using cosine distance
knn = NearestNeighbors(n_neighbors=k, metric='cosine', algorithm='brute')
knn.fit(combined_weighted_features)
Out[34]:
NearestNeighbors(algorithm='brute', metric='cosine', n_neighbors=15)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
NearestNeighbors(algorithm='brute', metric='cosine', n_neighbors=15)
In [35]:
def recommend_knn(book_title,
                  df,
                  knn_model,
                  combined_features_scaled,
                  n_recommendations=5,
                  columns=['Title', 'Author', 'Related Genres', 'Subgenres', 'Spectral Cluster'], 
                  verbose=True):
    try:
        book_idx = df[df['Title'] == book_title].index[0]
    except IndexError:
        return f'Book "{book_title}" not found.'

    # get the vector for the book
    book_vector = combined_features_scaled[book_idx].reshape(1, -1)

    # find k nearest neighbors (including the book itself)
    distances, indices = knn_model.kneighbors(
        book_vector, n_neighbors=n_recommendations+1)

    # get indices of the recommended books (first one is the book itself)
    recommended_indices = indices[0][1:]
    recommended_distances = distances[0][1:]

    # create a df with recommended books and distances
    recommendations_df = df.iloc[recommended_indices].copy()
    recommendations_df['Distance'] = recommended_distances

    if verbose:
        display(df[df["Title"] == book_title][columns])
        
    return recommendations_df[columns + ["Distance"]]
In [36]:
recommend_knn("Diary of a Wimpy Kid", df, knn, combined_weighted_features)
Title Author Related Genres Subgenres Spectral Cluster
382 Diary of a Wimpy Kid Jeff Kinney [Fiction And Literature, Kids And Teens] [Humor, Ages 9 to 12] 0
Out[36]:
Title Author Related Genres Subgenres Spectral Cluster Distance
339 Diary Of A Wimpy Kid ; The Ugly Truth Jeff Kinney [Kids And Teens] [Ages 9 to 12] 0 0.315828
781 Diary Of A Wimpy Kid: No Brainer Jeff Kinney [Kids And Teens, Fiction And Literature, Manga... [Childrens, Humor, Graphic Novels] 0 0.331757
360 Diary of a Wimpy Kid: Wrecking Ball Jeff Kinney [Kids And Teens] [Ages 9 to 12] 0 0.352462
353 Cabin Fever Jeff Kinney [Kids And Teens] [Ages 9 to 12] 0 0.363305
735 Diary of an Awesome Friendly Kid: Rowley Jeffe... Jeff Kinney [Fiction And Literature, Kids And Teens, Manga... [Humor, Young Adult, Childrens, Graphic Novels... 0 0.383408
In [37]:
recommend_knn("Jay Vudi", df, knn, combined_weighted_features, n_recommendations=10)
Title Author Related Genres Subgenres Spectral Cluster
1151 Jay Vudi Bhairav Aryal [Nepali] [Nepali Literature] 3
Out[37]:
Title Author Related Genres Subgenres Spectral Cluster Distance
1140 Damini Bhir Rajan Mukarung [Nepali] [Nepali Literature, Nepali Language] 3 0.389024
1093 Ghatmandu Kumar Nagarkoti [Nepali] [Nepali Literature, Nepali Language] 3 0.399716
1149 Ijajatpatra Sarthak Karki [Nepali] [Nepali Literature] 0 0.402148
1867 Mountains painted with turneric Lil Bahadur Chettri [Travel] [Climbing and Mountaineering] 3 0.416109
195 Arresting god in kathmandu Samrat Upadhyay [Fiction And Literature] [Short Story, Contemporary] 3 0.422693
1084 Lato Pahad Upendra Subba [Nepali] [Nepali Literature] 3 0.424879
224 Karnali Blues Buddhisagar, Michael Hutt (Translator) [Fiction And Literature] [Asian Literature, Contemporary] 3 0.430533
524 Ratna's basic Nepali dictionary Shyam P. Wagley, Bijay Kumar Rauniyar [Learning And Reference] [Dictionaries] 3 0.431791
1122 Kumari Prashnaharu Durga Karki [Nepali] [Nepali Literature] 3 0.459874
1139 Nepalese Folklore: Kirati Tales Shiva Kumar Sheratha [Nepali] [Books on Nepal] 3 0.474343
In [38]:
recommend_knn("Satori in Paris", df, knn, combined_weighted_features, n_recommendations=15)
Title Author Related Genres Subgenres Spectral Cluster
1910 Satori in Paris Jack Kerouac [Fiction And Literature] [Classics, Contemporary] 3
Out[38]:
Title Author Related Genres Subgenres Spectral Cluster Distance
1870 Lonesome Traveler Jack Kerouac [Fiction And Literature] [Classics] 0 0.398846
1602 Metamorphosis and Other Stories Franz Kafka [Fiction And Literature, Spirituality And Phil... [Short Story, Classics, Philosophy] 0 0.425701
1553 Metamorphosis and Other Stories Franz Kafka [Fiction And Literature, Spirituality And Phil... [Modern Classic, Philosophy] 0 0.425715
1149 Ijajatpatra Sarthak Karki [Nepali] [Nepali Literature] 0 0.448904
271 Sakshi Chetna: Amrita Pritam Rajesh Chandra [History Biography And Social Science, Foreign... [Memoir, Hindi] 3 0.466396
821 India My Love Dominique Lapierre [Miscellaneous] [Books on India] 3 0.479485
1493 Dharmayoddha Kalki Kevin Missal [Fiction And Literature, Spirituality And Phil... [Fantasy, Mythology] 3 0.485310
855 First Person Singular Haruki Murakami [Miscellaneous] [] 0 0.486236
1128 Nun Tel Jeevan Chhetri [Nepali] [Nepali Literature, Nepali Language] 3 0.488469
1151 Jay Vudi Bhairav Aryal [Nepali] [Nepali Literature] 3 0.489157
1825 The Two-Year Mountain Phil Deutschle [Travel] [Travel Guide Books, Climbing and Mountaineering] 3 0.494712
1819 The Journey Home: Autobiography of an American... Radhanath Swami [Spirituality And Philosophy, History Biograph... [Philosophy, Biography, Memoir, Autobiography,... 3 0.500038
1142 Loo Nayan Raj Pandey [Nepali] [Nepali Literature] 3 0.502328
282 Alchemist (Hindi) Paul Cornell [Fiction And Literature, Foreign Languages] [Fantasy, Hindi] 0 0.506093
1120 Aja Ramita Chha Indra Bahadur Rai [Nepali] [Nepali Literature] 3 0.506831

Recommendation Visualizations¶

In [39]:
recommendations = recommend_knn("The Bell Jar", 
                                df, 
                                knn, 
                                combined_weighted_features, 
                                n_recommendations=20, 
                                verbose=False)
print(recommendations.columns)
Index(['Title', 'Author', 'Related Genres', 'Subgenres', 'Spectral Cluster',
       'Distance'],
      dtype='object')

Distribution¶

In [40]:
rec_genres = recommendations["Related Genres"].explode().value_counts().reset_index()
rec_subgenres = recommendations["Subgenres"].explode().value_counts().reset_index()

genre_dist_plot = make_subplots(cols=1, rows=2, 
                                subplot_titles=('Distribution of Recommendations by Genre', 
                                                'Distribution of Recommendations by Subgenres'), 
                                specs=[[{'type': 'pie'}], [{'type': 'pie'}]])

genre_dist_plot.add_trace(go.Pie(values=rec_genres['count'], 
                                 labels=rec_genres['Related Genres'], 
                                 name='Related Genres',
                                 hole=0.4,
                                 legendgroup='genre',
                                 showlegend=True), 
                          row=1, 
                          col=1)
genre_dist_plot.add_trace(go.Pie(values=rec_subgenres['count'], 
                                 labels=rec_subgenres['Subgenres'], 
                                 name='Subgenres',
                                 hole=0.4,
                                 legendgroup='subgenre', 
                                 showlegend=True), 
                          row=2, 
                          col=1)

genre_dist_plot.update_layout(
    height=800, 
    title_text="Recommendations Genre and Subgenre Distribution - 'The Bell Jar'",
)
genre_dist_plot.update_traces(
    legendgroup="genre", 
    showlegend=True, 
    row=1, col=1
)

genre_dist_plot.update_traces(
    legendgroup="subgenre", 
    showlegend=True, 
    row=2, col=1
)
genre_dist_plot.show()

Distance¶

In [41]:
book_title = "The Bell Jar"

umap_model = umap.UMAP(n_neighbors=15, random_state=42)
umap_embeddings = umap_model.fit_transform(similarity_matrix)

recommended_idx = recommendations.index

# filter the UMAP embeddings to only include recommended books
umap_recommendations = umap_embeddings[recommended_idx]

umap_df = pd.DataFrame(umap_recommendations, columns=["UMAP1", "UMAP2"])
umap_df['Title'] = recommendations['Title'].values

# index of original book
original_book_idx = df[df['Title'] == book_title].index[0]

original_book_umap = umap_embeddings[original_book_idx].reshape(1, -1)

original_book_df = pd.DataFrame(original_book_umap, columns=["UMAP1", "UMAP2"])
original_book_df['Title'] = book_title

umap_df_combined = pd.concat([umap_df, original_book_df], ignore_index=True)

fig = go.Figure()

# Add lines from the original book to every recommended book
for idx, row in umap_df.iterrows():
    fig.add_trace(go.Scatter(x=[original_book_df['UMAP1'].values[0], row['UMAP1']], 
                             y=[original_book_df['UMAP2'].values[0], row['UMAP2']], 
                             mode='lines',
                             line=dict(color='gray', width=0.5),
                             showlegend=False))

# Add the recommended books
fig.add_trace(go.Scatter(x=umap_df['UMAP1'], 
                         y=umap_df['UMAP2'], 
                         mode='markers', 
                         text=umap_df['Title'], 
                         name='Recommended',
                         marker=dict(color='#1F77B4', line=dict(width=1, color='DarkSlateGray'))))

# Add the original book trace
fig.add_trace(go.Scatter(x=original_book_df['UMAP1'], 
                         y=original_book_df['UMAP2'], 
                         mode='markers+text', 
                         text=original_book_df['Title'], 
                         name='Original Book',
                         marker=dict(color='red', size=12, symbol='x')))

fig.update_layout(title="UMAP Projection of Book Recommendations with Original Book", height=500)
fig.show()

Function All That¶

In [56]:
def visualize_recommendations(book_title: str, n_recommendations: int = 10) -> None:
    recommendations = recommend_knn(book_title, 
                                df, 
                                knn, 
                                combined_weighted_features, 
                                n_recommendations, 
                                verbose=False)

    # Pie Charts
    rec_genres = recommendations["Related Genres"].explode().value_counts().reset_index()
    rec_subgenres = recommendations["Subgenres"].explode().value_counts().reset_index()
    
    genre_dist_plot = make_subplots(cols=1, rows=2, 
                                    subplot_titles=('Distribution of Recommendations by Genre', 
                                                    'Distribution of Recommendations by Subgenres'), 
                                    specs=[[{'type': 'pie'}], [{'type': 'pie'}]])
    
    genre_dist_plot.add_trace(go.Pie(values=rec_genres['count'], 
                                     labels=rec_genres['Related Genres'], 
                                     name='Related Genres',
                                     hole=0.4,
                                     legendgroup='genre',
                                     showlegend=True), 
                              row=1, 
                              col=1)
    genre_dist_plot.add_trace(go.Pie(values=rec_subgenres['count'], 
                                     labels=rec_subgenres['Subgenres'], 
                                     name='Subgenres',
                                     hole=0.4,
                                     legendgroup='subgenre', 
                                     showlegend=True), 
                              row=2, 
                              col=1)
    
    genre_dist_plot.update_layout(
        height=800, 
        title_text=f"Recommendations Genre and Subgenre Distribution - {book_title}",
    )
    genre_dist_plot.update_traces(
        legendgroup="genre", 
        showlegend=True, 
        row=1, col=1
    )
    
    genre_dist_plot.update_traces(
        legendgroup="subgenre", 
        showlegend=True, 
        row=2, col=1
    )
    genre_dist_plot.show()

    # Cluster Distance
    umap_model = umap.UMAP(n_neighbors=n_recommendations, min_dist=0.1, random_state=42, metric='cosine')
    umap_embeddings = umap_model.fit_transform(similarity_matrix)
    
    recommended_idx = recommendations.index
    
    # filter the UMAP embeddings to only include recommended books
    umap_recommendations = umap_embeddings[recommended_idx]
    
    umap_df = pd.DataFrame(umap_recommendations, columns=["UMAP1", "UMAP2"])
    umap_df['Title'] = recommendations['Title'].values
    
    # index of original book
    original_book_idx = df[df['Title'] == book_title].index[0]
    
    original_book_umap = umap_embeddings[original_book_idx].reshape(1, -1)
    
    original_book_df = pd.DataFrame(original_book_umap, columns=["UMAP1", "UMAP2"])
    original_book_df['Title'] = book_title
    
    umap_df_combined = pd.concat([umap_df, original_book_df], ignore_index=True)
    
    fig = go.Figure()

    # add lines from the original book to every recommended book
    for idx, row in umap_df.iterrows():
        fig.add_trace(go.Scatter(x=[original_book_df['UMAP1'].values[0], row['UMAP1']], 
                                 y=[original_book_df['UMAP2'].values[0], row['UMAP2']], 
                                 mode='lines',
                                 line=dict(color='gray', width=0.5),
                                 showlegend=False))
    
    # Add the recommended books
    fig.add_trace(go.Scatter(x=umap_df['UMAP1'], 
                             y=umap_df['UMAP2'], 
                             mode='markers', 
                             text=umap_df['Title'], 
                             name='Recommended',
                             marker=dict(color='#1F77B4', line=dict(width=1, color='DarkSlateGray'))))
    
    # Add the original book trace
    fig.add_trace(go.Scatter(x=original_book_df['UMAP1'], 
                             y=original_book_df['UMAP2'], 
                             mode='markers+text', 
                             text=original_book_df['Title'], 
                             name='Original Book',
                             marker=dict(color='red', size=12, symbol='x')))
    
    fig.update_layout(title="UMAP Projection of Book Recommendations with Original Book", height=500)
    fig.show()
In [57]:
visualize_recommendations(book_title="Diary of a Wimpy Kid")

Validation¶

Genre Diversity¶

We could determine the genre diversity of the recommendations using the Simpson's Diversity Index.

Simpson's Diversity Index (SDI) is a measure of diversity that takes into account both the number of categories (e.g., genres) and the relative abundance of each category.

$$ D = 1 - \sum p^2_i $$

where,

  • $D$ is the Simpson's Diversity Index, ranging from 0 to 1 (values closer to 1 represent higher diversity)
  • $p_i$ is the proportion each category $i$ relative to the total
In [44]:
def diversity_score(recommendations, feature="Related Genres"):
    feature_counts = recommendations[feature].explode().value_counts()
    
    feature_proportions = feature_counts / feature_counts.sum()
    display(feature_proportions)
    
    # Simpson's Diversity Index
    diversity = 1 - (feature_proportions ** 2).sum()
    return diversity
In [45]:
diversity_score(
    recommend_knn("The Bell Jar", df, knn, combined_weighted_features, n_recommendations=20)
)
Title Author Related Genres Subgenres Spectral Cluster
307 The Bell Jar Sylvia Plath [Fiction And Literature, History Biography And... [Poetry and Prose, Classics, Psychology, Femin... 0
316 The Bell Jar Sylvia Plath [Fiction And Literature, History Biography And... [Classics, Psychology, Feminism] 0
Related Genres
Fiction And Literature                  0.484848
History Biography And Social Science    0.303030
Self Improvement And Relationships      0.090909
Spirituality And Philosophy             0.060606
Arts And Photography                    0.030303
Kids And Teens                          0.030303
Name: count, dtype: float64
Out[45]:
0.6593204775022956
In [46]:
diversity_score(
    recommend_knn("Jay Vudi", df, knn, combined_weighted_features, n_recommendations=20)
)
Title Author Related Genres Subgenres Spectral Cluster
1151 Jay Vudi Bhairav Aryal [Nepali] [Nepali Literature] 3
Related Genres
Nepali                                  0.458333
Fiction And Literature                  0.208333
Spirituality And Philosophy             0.125000
Religion                                0.083333
Travel                                  0.041667
Learning And Reference                  0.041667
History Biography And Social Science    0.041667
Name: count, dtype: float64
Out[46]:
0.71875
In [47]:
diversity_score(
    recommend_knn("Beautiful World, Where Are You", 
                  df, 
                  knn, 
                  combined_weighted_features, 
                  n_recommendations=20)
)
Title Author Related Genres Subgenres Spectral Cluster
240 Beautiful World, Where Are You Sally Rooney [Fiction And Literature] [Contemporary, Romance] 0
Related Genres
Fiction And Literature                  0.517241
History Biography And Social Science    0.137931
Kids And Teens                          0.103448
Foreign Languages                       0.068966
Spirituality And Philosophy             0.034483
Nature                                  0.034483
Self Improvement And Relationships      0.034483
Manga And Graphic Novels                0.034483
Lifestyle And Wellness                  0.034483
Name: count, dtype: float64
Out[47]:
0.6920332936979786

Intra-List Similarity¶

In [48]:
def intra_list_similarity(recommendations, embedding_matrix):
    total_similarity = 0
    count = 0

    for rec_indices in recommendations:
        if len(rec_indices) < 2:
            continue
        
        list_embeddings = embedding_matrix[rec_indices].toarray()

        # calculate cosine similarities within the list
        similarities = cosine_similarity(list_embeddings)
        
        # sum of lower triangle (to count unique pairs only)
        total_similarity += np.tril(similarities, -1).sum()
        count += (len(rec_indices) * (len(rec_indices) - 1)) / 2

    ils = total_similarity / count if count > 0 else 0
    return ils
In [49]:
book_titles = ["The Bell Jar", "Beautiful World, Where Are You", "Jay Vudi"]
recommendation_indices = []

for title in book_titles:
    recs_df = recommend_knn(title, df, knn, combined_weighted_features, n_recommendations=k, verbose=False)
    rec_indices = recs_df.index.to_list()
    recommendation_indices.append(rec_indices)  # Append to the main list

# Calculate ILS across all recommendation lists
ils_score = intra_list_similarity(recommendation_indices, combined_weighted_features)
print("Intra-List Similarity (ILS):", ils_score)
Intra-List Similarity (ILS): 0.45361794864451843
In [50]:
df[df["Title"].str.startswith("Captain Underpants")]["Title"].values
Out[50]:
array(['Captain Underpants and the revolting revenge of the radioactive robo-boxers',
       'Captain Underpants and the Captain Underpants and the Wrath of the Wicked Wedgie Woman'],
      dtype=object)
In [51]:
# children's books
book_titles = ["Peter Pan", 
               "Diary of a Wimpy Kid", 
               "The Jungle Book"]
recommendation_indices = []

for title in book_titles:
    recs_df = recommend_knn(title, df, knn, combined_weighted_features, n_recommendations=k, verbose=False)
    rec_indices = recs_df.index.to_list()
    recommendation_indices.append(rec_indices)

# Calculate ILS across all recommendation lists
ils_score = intra_list_similarity(recommendation_indices, combined_weighted_features)
print("Intra-List Similarity (ILS):", ils_score)
Intra-List Similarity (ILS): 0.5094732627661639

In [52]:
print(df.query("Title == 'Satori in Paris'")["Synopsis"].values[0])
print("\n\n")
print(df.query("Title == 'India My Love'")["Synopsis"].values[0])
This semi-autobiographical tale of Kerouac's own trip to France, to trace his ancestors and explore his own understanding of the Buddhism that came to define his beliefs, contains some of Kerouac's most lyrical descriptions. From his reports of the strangers he meets and the all-night conversations he enjoys in seedy bars in Paris and Brittany, to the moment in a cab he experiences Buddhism's satori - a feeling of sudden awakening - Kerouac's affecting and revolutionary writing transports the reader. Published at the height of his fame, Satori in Parisis a hectic tale of philosophy, identity and the powerful strangeness of travel.



Five Past Midnight in Bhopal and The City of Joy This is the extraordinary story of Dominique Lapierre’s love affair with India, from his first 20,000 kilometre drive across the subcontinent in a veteran Silver Cloud Rolls-Royce gathering unique testimonies for his epic account of India’s independence, to his later encounters with the country’s disinherited and its saints, who taught him a wonderful lesson in sharing and hope and gave birth to the internationally renowned book and film, “The City of Joy”. It is a tale of maharajas and rickshaw pullers, of interviews with Indira Gandhi and the brother of Gandhi’s assassin, of life-changing meetings with Mother Teresa and the victims of the Bhopal disaster, of pig sticking on horseback, of life in the slums with a Swiss nurse, and of saving a home for children affected by leprosy. Above all, it is an insight into how India, with its immense mosaic of people and fascinating culture stole a Frenchman’s heart and turned his life into a testimony to the fact that “All that is not given is lost”.

Export Combined Matrix¶

npz is a numpy file format that stores array data using gzip

In [53]:
from scipy import sparse

sparse.save_npz("out/combined_weighted_feature.npz", combined_weighted_features)
In [54]:
combined_scaled_features = sparse.load_npz("out/combined_weighted_feature.npz")
In [55]:
recommend_knn("The Bell Jar", df, knn, combined_weighted_features, n_recommendations=15)
Title Author Related Genres Subgenres Spectral Cluster
307 The Bell Jar Sylvia Plath [Fiction And Literature, History Biography And... [Poetry and Prose, Classics, Psychology, Femin... 0
316 The Bell Jar Sylvia Plath [Fiction And Literature, History Biography And... [Classics, Psychology, Feminism] 0
Out[55]:
Title Author Related Genres Subgenres Spectral Cluster Distance
230 It Ends With Us Colleen Hoover [Fiction And Literature] [Romance, Contemporary] 0 0.450053
208 The Seven Husbands of Evelyn Hugo Taylor Jenkins Reid [Fiction And Literature] [Adult Fiction, Romance] 0 0.455673
333 The Unabridged Journals of Sylvia Plath Sylvia Plath [Fiction And Literature] [Poetry and Prose, Classics] 0 0.458800
1607 What I Know For Sure Oprah Winfrey [Self Improvement And Relationships, History B... [Self Help, Motivational, Biography, Autobiogr... 0 0.470190
1827 Eat Pray Love Elizabeth Gilbert [History Biography And Social Science, Fiction... [Autobiography, Biography, Memoir, Womens Fict... 0 0.479998
317 Jane Eyre Charlotte Brontë [Fiction And Literature, History Biography And... [Romance, Classics, History] 0 0.489899
937 Upstream Mary Oliver [Fiction And Literature, History Biography And... [Poetry and Prose, Short Story, Memoir] 0 0.495724
1507 Eleven Minutes Paulo Coelho [Fiction And Literature, Spirituality And Phil... [Contemporary, Drama, Romance, Philosophy] 0 0.501274
1588 The Forty Rules of Love Elif Shafak [Fiction And Literature, Spirituality And Phil... [Historical Fiction, Romance, Philosophy] 0 0.502542
1443 Everything I Know about Love Dolly Alderton [Self Improvement And Relationships, History B... [Self Help, Memoir] 0 0.509280
253 There's No Place Like Here Cecelia Ahern [Fiction And Literature] [Mystery, Thriller and Suspense, Fantasy, Wome... 0 0.519018
876 Christmas at Tuppenny Corner Katie Flynn [Fiction And Literature] [Contemporary] 0 0.519239
161 Beach Read Emily Henry [Fiction And Literature] [Womens Fiction, Contemporary, Romance] 0 0.519777
240 Beautiful World, Where Are You Sally Rooney [Fiction And Literature] [Contemporary, Romance] 0 0.522462
311 The Diary of a young girl Anne Frank [History Biography And Social Science, Fiction... [Biography, History, Classics] 0 0.523078
In [ ]: